Cross-Validation - Methods & Importance in Data Science and Statistics

Jesse Buntyn, Jolie Wise, Danielle Koche

2024-04-16

Introduction

Cross-validation is a cornerstone methodology in the field of data science, essential for assessing the predictive performance of statistical models and ensuring their generalizability to unseen data. This resampling technique allows researchers to evaluate how models will perform in practice, addressing critical challenges like overfitting and underfitting, thereby ensuring robustness and reliability of model predictions across various data subsets.

Cross-Validation Steps

  1. Split the data into a training set and test set, typically 80/20 or 70/30.
  2. Fit a model to the training set and obtain the model parameters.
  3. Apply the fitted model to the test set and obtain prediction accuracy.
  4. Repeat steps one through three.
  5. Calculate the average cross-validation prediction accuracy across all repetitions.

(Song, Tang, and Wee 2021)

Cross-Validation Techniques

  • Single Hold-Out Method
  • Monte Carlo Cross-Validation
  • K-Fold Cross-Validation
  • Leave-One-Out Cross-Validation (LOOCV)

Single Hold-Out Method:

This is one of the more simple methods of cross-validation and tends to be less time-consuming to conduct (Yadav and Shukla 2016). It divides data into a training set and a test set, usually in a 70-30 or 90-10 split.

Monte Carlo Cross-Validation:

Using random sampling without replacement from certain probability distributions when splitting your data sets (James 1980). It aims to decrease overfitting by allowing the algorithm to better explore the experimental space.

K-Fold Cross-Validation:

The dataset is split into k equally sized folds. Each fold is used once as a test set while the rest serve as the training set.

Leave-One-Out Cross-Validation (LOOCV):

This is a special case of k-fold cross-validation where k equals the number of data points (Shao and Er 2016). Each data point is used once as a test set. In a sense, Leave-One-Out Cross-Validation (LOOCV) is equivalent to setting k=1 in k-Fold Cross-Validation.

Literature Review - (Zhang and Yang 2015)

Zhang and Yang (2015) focus on cross-validation as a model selection method.

  • Cross-validation techniques can be used for model selection in high-dimensional settings
  • Model parameters can be selected through the use of cross-validation (ex., Akaike Information Criterion (AIC) vs. Bayesian Information Criterion (BIC))
  • Highlight the importance of selecting an ideal data-splitting ratio for the dataset
    • Parametric vs Nonparametric Data

Literature Review - (Wong and Yeh 2019) & (Marcot and Hanea 2021)

Wong and Yeh (2020) examine the reliability of accuracy estimates derived from k-fold cross-validation.

  • The correlation between replications of k-folds should be considered before increasing the number of replications.
  • Experimentation across 20 datasets
  • It is recommended to use a larger number of folds with fewer replications
    • 10-Fold Cross-Validation was found to be ideal

Marcot and Hanea (2021) confirmed that 10 folds is the standard, but 5-folds can also work in some cases.

Literature Review - (Filzmoser, Liebmann, and Varmuza 2009)

Filzmoser et al. (2009) explored repeated double cross-validation (rdCV).

  • The goals of rdCV are as follows:
    • Improve predictive accuracy of complex linear models
    • Provide more accurate prediction error estimates for future models
  • The procedure for rdCV is to:
    • Split the data into testing and training subsets
    • Perform another round of cross-validation on the training data
    • Repeat the process multiple times

Literature Review - (Xu et al. 2018)

Xu et al. (2018) introduce Representative Splitting Cross-Validation (RSCV).

  • RSCV aims to enhance representativeness and uniformity in data sets
  • The DUPLEX algorithm is employed to split the data into training and testing sets and uniformly distribute the farthest data
  • K-Fold cross-validation is then performed on the split data
  • RSCV was only found to be significantly different from traditional cross-validation techniques in one experiment.

Literature Review - (Rabinowicz and Rosset 2022)

Rabinowicz & Rosset (2022) propose a bias-correction measure, CVc to explore the influences of correlational structure on cross-validation effectiveness.

  • Correction is unnecessary when distribution of data is not different when between training and testing datasets.
  • The performance of CVc was compared to that of basic cross-validation on two datasets: one with a cluster of correlation structure, and one with a hierarchical spatial correlation structure.
  • CVc more accurately estimated generalization error, remaining unbiased as long as the covariance structure was specified.

Literature Review - (Lei 2020)

Lei (2020) introduce Cross-Validation with Confidence (CVC) to try to reduce overfitting.

  • CVC uses the predicted risk of candidate models and their residuals to fit models.
    • This method can be used for model fitting as well as parameter tuning.
  • CVC can be computationally complex, but Lei claims the running time is comparable to other cross-validation methods.
  • When compared to standard cross-validation, CVC was able to produce an adequate predictive model with fewer predictors and better interpretibility.
  • CVC had not been considered for unsupervised methods or binary outcome variables at the time of publication.

Methods - CV Techniques

Single Hold-Out

Single hold-out cross-validation is one more simple versions of cross-validation being used (Yadav and Shukla 2016).

  1. Data gets divided into a training set and a validation set
  2. Fit a model using the training data
  3. Use the model on the validation set to create predicted outcome values
  4. Compare the predicted values to the actual outcome values to determine model performance
  • Can cause unreliable models due to differently distributed training and validation data (Santos et al. 2023).

Methods - CV Techniques

K-Fold

  1. Randomly and evenly split the dataset into k-folds.

  2. Use k-1 folds as the training set to fit the model.

  3. Predict the response variable in the hold-out (kth) fold using the fitted model.

  4. Calculate the prediction error for the response variable in the hold-out fold.

  5. Repeat steps 2-4 k times, using each fold as a hold-out once.

  6. Calculate the overall test predictors by averaging all k test predictors.

Figure 1: Basic K-Fold Cross-Validation Source:(Ranjan 2021)

The MSE calculation using K-folds is as follows (Jung and Hu 2015):

\[ CV(k) = \frac{1}{k} \sum_{i=1}^{k} MSE_i \tag{1} \]

Methods - CV Techniques

Leave-One-Out

  • LOOCV is a variation of k-fold cross-validation
  • The validation set is comprised of a single observation, so k is equal to the sample size (Shao and Er 2016).
  • This technique can be ideal with smaller sample sizes, as the lack of randomization when dividing the dataset creates constant accuracy estimates (Wong 2015)
  • It can be computationally expensive due to the high number of folds for large datasets.

Methods - CV Techniques

Monte Carlo

MCCV randomly removes significant chunks of the dataset without replacement to use as its training and validation (\(s_v\)) data sets (Haddad et al. 2013). This sampling is repeated N times. The criterion for this techniques, according to (Haddad et al. 2013), is as follows:

\[ MCCV_{n_v}(\phi)=\frac{1}{Nn_v}\sum_{i=1}^{k}(y_{(s_v)(i)}-\hat{y}_{\phi(s_v)(i)})^2 \tag{2} \]

  • Observations could be used in the validation set multiple times or not at all
  • Reduced computational complexity than other techniques (ex., LOOCV) (James 1980)

Performance Metrics - MSE

The MSE is the expectation of the squared norm of the difference between the estimated and the true parameter vectors, and can be calculated as:

\[ MSE = \sum_{i=1}^{p} (\hat{\beta}_i - \beta_i)^2 \tag{3} \]

where \(\hat{\beta}_i\) is the estimated value of the \(i\)-th parameter, \(\beta_i\) is the true value of the \(i\)-th parameter, and \(p\) is the total number of parameters (Jung and Hu 2015).

  • The lower the MSE, the better the model’s predictions match the actual data.

Performance Metrics - AUC

Another performance metric that will be used to determine model performance is the area under the ROC curve (AUC).

  • AUC conveys how well a model can differentiate between different classes, and it is preferable to other metrics, such as accuracy (Cantor and Kattan 2000).
  • AUC scores can range from 0 to 1

Formula 4 demonstrate how AUC uses the positive and negative outcomes to calculate a value that indicates the probability of a positive result ranking higher than negative results, where \(S_p\) is the total number of positive outcomes ranked, and \(n_p\) and \(n_n\) are the number of positive and negative outcomes, respectively (Hossin and Sulaiman 2015).

\[ AUC = \frac{S_p-n_p(n_n+1)/2}{n_pn_n} \tag{4} \]

Figure 2: ROC AUC Curve

Source:(“How to Explain the ROC Curve and ROC AUC Score?” 2024)

Analysis and Results

Data Description & Features

  • The dataset contains 13 clinical, body, and lifestyle features from 299 heart failure patients in Faisalabad, Pakistan in 2015
  • The goal was to demonstrate potential biomarkers in predicting heart failure outcomes
  • The study focused solely on serum creatinine and ejection fraction as predictors of patient survival
Variable Type Characteristic
Age Predictor Integer
Anaemia Predictor Binary (1 = anemic, 0 = not)
creatine_phosphokinase Predictor Integer
Diabetes Predictor Binary (1 = diabetic, 0 = not)
ejection_fraction Predictor Integer
high_blood_pressure Predictor Binary (1 = HBP present, 0 = no HBP)
Platelets Predictor Continuous
serum_creatinine Predictor Continuous
serum_sodium Predictor Integer
Sex Predictor Binary (1 = M, 0 = F)
Smoking Predictor Binary (1 = smoker, 2 = not a smoker)
Time Predictor Integer
death_event Response Binary (1 = death, 0 = no death)

Continuous Variables

  • The age of patients ranged from 40 to 95 years old, which the majority aged below 70 years old.
  • The time variable is the number of days in the follow-up period for each patient, ranging from 4 to 285 days.
  • Creatinine phosphokinase indicates the level of this enzyme in the patient’s blood, which is present when muscle tissue is damaged (Chicco and Jurman 2020).
  • The serum creatinine is created by this creatinine when muscle breaks down (Chicco and Jurman 2020)

Continuous Variables (cont.)

  • Variables were included that observe the levels of sodium (serum sodium) and number of platelets in the patient’s blood.
  • The ejection fraction variable is the percentage of blood leaving the left ventricle during each contraction of the heart (Chicco and Jurman 2020).
  • These features look fairly normally distributed, with some potential skewing in the platelet and serum sodium features.

Categorical Variables

  • All of our categorical variables are Boolean and are encoded as ‘0’ and ‘1’.
  • These variables include whether the patient is a smoker, has diabetes, anaemia, or high blood pressure.
  • The sex variable indicates whether the patient is a female (‘0’) or male (‘1’).
  • Our outcome variable, death event, indicates whether the patient died before the end of the follow-up period.

Missing Values

                     age                  anaemia creatinine_phosphokinase 
                       0                        0                        0 
                diabetes        ejection_fraction      high_blood_pressure 
                       0                        0                        0 
               platelets         serum_creatinine             serum_sodium 
                       0                        0                        0 
                     sex                  smoking                     time 
                       0                        0                        0 
             DEATH_EVENT 
                       0 

We found the total number of missing values for each variable, and none were found. The analysis can continue without deleting rows or imputation of values.

Checking for Outliers

  • Boxplots for our continuous variables were created in our exploratory analysis to evaluate if any variables have outliers that need to be addressed.
  • However, these data points will remain untouched for our analysis as they appear to be genuine observations and should be considered in our model.

Correlation

Statistical Modeling

Since the outcome variable (death event) is binary, logistic regression will be used with five different cross-validation techniques: 10-fold, 5-fold, single hold-out, leave-one-out, and Monte Carlo cross-validation. Two different models will be fit for each technique: one with every predictor variable included (Model 1), and the other using only serum creatinine and ejection fraction (Model 2), as was advised by (Chicco and Jurman 2020). The goal is to compare how the difference in performance between the two models changes when various cross-validation techniques are applied. The performance of the models will be determined using their MSE and AUC values.

10-fold Cross Validation

Model 1

With any variation of k-fold cross validation, the dataset is split into k sized folds that will all be equal. The function used to create our cross-validation was cv.glm, which only requires your data, the glm model, and the number of folds for the cross validation. Here, the dataset is split in 10 folds. As previously discussed, (Wong and Yeh 2019) mentions 10-fold cross validation analysis to be most effective in most cases. Our glm model was changed from a linear regression to logistic, since our outcome variable is binary.

10-fold Cross Validation

Model 1

In model 1, all predictor variables are included. The default metric for cv.glm is MSE, so a cost function was created to calculate the AUC values for the models. This cost function will be used for both k-folds and LOOCV.

10-fold Cross Validation

Model 1 Code

library(boot)
auc <- function(obs, pred.p){
  pred <- prediction(pred.p, obs)
  perf <- performance(pred, "tpr", "fpr")
  cost = unlist(slot(performance(pred, "auc"), "y.values"))
  return(cost)
} 

#Model 1
set.seed(43269)
m1 <- glm(DEATH_EVENT ~ age + anaemia + creatinine_phosphokinase + diabetes + ejection_fraction + high_blood_pressure + platelets + serum_creatinine + serum_sodium + sex + smoking + time,
         data = data, 
         family = "binomial"(link = "logit"))

k10_m1_mse <- cv.glm(data, m1, K=10)$delta
k10_m1_auc <- cv.glm(data, m1, K=10, cost=auc)$delta

10-fold Cross Validation

Model 2

In model 2, only ejection fraction and serum creatinine are included, as the researchers that provided the data set came to the conclusion that death events could be predicted using those 2 variables alone.

# Model 2
set.seed(85329)
m2 <- glm(DEATH_EVENT ~ ejection_fraction + serum_creatinine,
         family = "binomial"(link = "logit"), data=data)

k10_m2_mse <- cv.glm(data, m2, K=10)$delta
k10_m2_auc <- cv.glm(data=data, glmfit=m2, cost=auc, K=10)$delta

10-fold Cross Validation

Performance

Interestingly enough, the first model that included all variables produced a lower MSE and a higher AUC, indicating that from this CV test, model one would be better to use than model 2. Model 2 has an AUC that is still acceptable, but the AUC of the first model is significantly better.

# A tibble: 1 × 4
  M1_K5_MSE M2_K5_MSE M1_K5_AUC M2_K5_AUC
      <dbl>     <dbl>     <dbl>     <dbl>
1     0.133     0.182     0.868     0.756

5-fold Cross Validation

Model 1

Similar to the 10-fold models above, our dataset is split into equal pieces using cv.glm. However, in this case, we are only doing 5 folds compared to our 10-fold analysis. Again, in model 1, all predictor variables are included.

library(boot)

set.seed(143269)
m1 <- glm(DEATH_EVENT ~ age + anaemia + creatinine_phosphokinase + diabetes + ejection_fraction + high_blood_pressure + platelets + serum_creatinine + serum_sodium + sex + smoking + time,
         data = data, 
         family = "binomial"(link = "logit"))

k5_m1_mse <- cv.glm(data, m1, K=5)$delta
k5_m1_auc <- cv.glm(data, m1, K=5, cost=auc)$delta

5-fold Cross Validation

Model 2

In model 2, only ejection fraction and serum creatinine are included.

set.seed(185329)
m2 <- glm(DEATH_EVENT ~ ejection_fraction + serum_creatinine,
         data = data, 
         family = "binomial"(link = "logit"))

k5_m2_mse <- cv.glm(data, m2, K=5)$delta
k5_m2_auc <- cv.glm(data, m2, K=5, cost=auc)$delta

5-fold Cross Validation

Performance

Model 1 once again performed the best based on its MSE and AUC values. Both AUC values are acceptable, and their MSE’s are fairly low. However, model 1’s AUC is significantly better.

# A tibble: 1 × 4
  M1_K5_MSE M2_K5_MSE M1_K5_AUC M2_K5_AUC
      <dbl>     <dbl>     <dbl>     <dbl>
1     0.136     0.180     0.872     0.759

Single Hold-Out Method

Test/Train Split & Model Creation

Using the Single Hold-Out Method, the data set is split into a training set and a test/validation set. Since the cv.glm function only applies to k-fold cross-validation, the splitting of the data will be done manually. Here, we are using a 50-50 split. Our functions for creating each model was identical to our previous techniques, using glm() to perform logistic regression.

Single Hold-Out Method

Test/Train Split & Model Code

#add obs # for each row
library(dplyr)
data_SHO <- data %>%
  mutate(obs = row_number()) %>%
  relocate(obs)

#split
library(rsample)
set.seed(726159)
split_SHO <- initial_split(data_SHO, prop = 0.5)
training_SHO <- training(split_SHO)
validation_SHO <- testing(split_SHO)

#create SHO models with training set
m1_SHO <- glm(DEATH_EVENT ~ age + anaemia + creatinine_phosphokinase + diabetes + ejection_fraction + high_blood_pressure + platelets + serum_creatinine + serum_sodium + sex + smoking + time,
         data = training_SHO, 
         family = "binomial"(link = "logit"))

m2_SHO <- glm(DEATH_EVENT ~ ejection_fraction + serum_creatinine,
         data = training_SHO, 
         family = "binomial"(link = "logit"))

Single Hold-Out Method

MSE & AUC Calculation

The models were applied to the validation set to find the predicted outcome values. The means of the squared errors of our predicted values was found for each model. To calculate the AUC values, we only needed to use the auc() function with the original outcome variable and the predicted outcome variables of our validation set.

Single Hold-Out Method

MSE & AUC Calculation

#calculate MSE
validation_SHO <- validation_SHO %>% 
  mutate(yhat_m1_SHO = predict(m1_SHO, newdata = validation_SHO),
         yhat_m2_SHO = predict(m2_SHO, newdata = validation_SHO)) %>%
  mutate(sqerr_m1_SHO = (yhat_m1_SHO - DEATH_EVENT)^2,
         sqerr_m2_SHO = (yhat_m2_SHO - DEATH_EVENT)^2) %>%
  relocate(obs, DEATH_EVENT, yhat_m1_SHO, sqerr_m1_SHO, yhat_m2_SHO, sqerr_m2_SHO)

SHO_m1_mse <- mean(validation_SHO$sqerr_m1_SHO, na.rm = TRUE)
SHO_m2_mse <- mean(validation_SHO$sqerr_m2_SHO, na.rm = TRUE)

#Calculate AUC
SHO_m1_auc = auc(validation_SHO$DEATH_EVENT,validation_SHO$yhat_m1_SHO)
SHO_m2_auc = auc(validation_SHO$DEATH_EVENT,validation_SHO$yhat_m2_SHO)

Single Hold-Out Method

Performance

After conducting the analysis, it can be seen that model 2 (m2) provides us with a significantly lower mean squared error. Both models have very similar AUC values, with the first model only producing a marginally higher value. M2 includes only ejection fraction and serum creatinine as predictor variables, meaning this reduced model is a better fit according to this cross-validation technique. This result aligns with the opinion of the original article this data was sourced from.

# A tibble: 1 × 4
  M1_SHO_MSE M2_SHO_MSE M1_SHO_AUC M2_SHO_AUC
       <dbl>      <dbl>      <dbl>      <dbl>
1       7.45       1.69      0.837      0.820

Leave-One-Out Cross-Validation (LOOCV)

LOOCV follows a similar data set splitting technique to the previous k-fold examples, but in this case, the data is split k-ways, where k is equal to the number of data points. In this case, k would be 299. Since LOOCV is a form of k-folds, cv.glm can be used for data splitting. The “K=” argument isn’t necessary when performing LOOCV, as the function assumes that the number of folds will be equal to the number of observations. Once again, the cost function is used to find our AUC values. The same function (glm()) was used to fit our logistic regression models as the other k-fold techniques.

LOOCV (cont.)

Once again, the processes for both models was identical apart from the number of predictors included in the models.

cost <- function(labels,pred){
 mean(labels==ifelse(pred > 0.5, 1, 0))
}
# Model 1
set.seed(7432)
m1 <- glm(DEATH_EVENT ~ age + anaemia + creatinine_phosphokinase + diabetes + ejection_fraction + high_blood_pressure + platelets + serum_creatinine + serum_sodium + sex + smoking + time,
         data = data, 
         family = "binomial"(link = "logit"))

loocv_m1_mse <- cv.glm(data, m1)$delta
loocv_m1_auc <- cv.glm(data, m1, cost = cost)$delta
# Model 2
set.seed(20987)
m2 <- glm(DEATH_EVENT ~ ejection_fraction + serum_creatinine,
         data = data, 
         family = "binomial"(link = "logit"))

loocv_m2_mse <- cv.glm(data, m2)$delta
loocv_m2_auc <- cv.glm(data, m2, cost = cost)$delta

Leave-One-Out Cross-Validation (LOOCV)

Performance

According to LOOCV, the full model (M1) is a better fit due to a lower MSE compared to the reduced model (M2). The AUC models follow a similar pattern, with model 1 having a higher AUC than model 2.

# A tibble: 1 × 4
  `M1 LOOCV MSE` `M2 LOOCV MSE` `M1 LOOCV AUC` `M2 LOOCV AUC`
           <dbl>          <dbl>          <dbl>          <dbl>
1          0.134          0.182          0.827          0.742

Monte Carlo Cross-Validation

Resampling Function & Model Creation

Unlike previous techniques, a mc_cv function was created to repeat the splitting of the data many times. Our proportion of training to validation sets is the standard 70/30 split, and the data will be resampled 100 times. Workflows were used to specify our model type (logistic regression), as and plug in the variables for each of our models.

Monte Carlo Cross-Validation

Code

data$DEATH_EVENT=as.factor(data$DEATH_EVENT)
set.seed(30)
resample1 <- mc_cv(data, times = 100, prop = .7)

mccv_model <- logistic_reg() %>% #creating classification model
  set_engine("glm") %>%
  set_mode("classification")
m1_wflow <- #adding variables for model 1
  workflow() %>%
  add_formula(
    DEATH_EVENT ~ .) %>%
  add_model(mccv_model)
m2_wflow <- # adding variables for model 2
  workflow() %>%
  add_formula(
    DEATH_EVENT ~ ejection_fraction + serum_creatinine) %>%
  add_model(mccv_model)

Monte Carlo Cross-Validation

Resampling & Metric Calculation

The fit_resamples() function applies our resampling function created earlier to our models to perform our cross-validation. For the MSE of the models, we’ll be using the Brier score as our resampling fitting function generates the Brier score automatically, and it performs similarly to MSE in classification problems (Ferro and Fricker 2012).

Monte Carlo Cross-Validation

Resampling & Metric Calculation Code

keep_pred <- control_resamples(save_pred = TRUE, save_workflow = TRUE)
set.seed(30)
m1_MCCV <-m1_wflow %>% fit_resamples(resamples = resample1, control = keep_pred)
m1_MCCV_metrics <- collect_metrics(m1_MCCV)

mccv_m1_mse <- as.numeric(m1_MCCV_metrics[2,3])
mccv_m1_auc <- as.numeric(m1_MCCV_metrics[3,3])

m2_MCCV <- m2_wflow %>% fit_resamples(resamples = resample1, control = keep_pred)
m2_MCCV_metrics <-collect_metrics(m2_MCCV)

mccv_m2_mse <- as.numeric(m2_MCCV_metrics[2,3])
mccv_m2_auc <- as.numeric(m2_MCCV_metrics[3,3])

Monte Carlo Cross-Validation

Performance

Similar to the previously discussed techniques, our full model had the lower MSE/Brier score and higher AUC. The smaller model’s AUC of 0.763 is still acceptable as far as the metric’s ideal scores go, but the increase of about 0.1 indicates there might be another predictor or two that could contribute significantly to the model.

# A tibble: 1 × 4
  M1_MCCV_MSE M2_MCCV_MSE M1_MCCV_AUC M2_MCCV_AUC
        <dbl>       <dbl>       <dbl>       <dbl>
1       0.139       0.183       0.864       0.763

Comparing Technique Performance

               CV    M1_MSE    M1_AUC    M2_MSE    M2_AUC
1         10-Fold 0.1327269 0.8675932 0.1816103 0.7559824
2          5-Fold 0.1355044 0.8719124 0.1798165 0.7585011
3 Single Hold-Out 7.4504247 0.8374603 1.6855614 0.8203175
4           LOOCV 0.1342105 0.8271048 0.1815950 0.7423519
5            MCCV 0.1393454 0.8644844 0.1831616 0.7633794

The only technique that stands out as having significantly different results was single hold-out cross-validation. It’s the only technique that clearly indicated the reduced model as the superior model. The significant difference leads us to believe the simplicity of the technique could have caused some overfitting and higher bias. Other training/testing split ratios could be considered to try to combat this. The rest of the techniques performed so similarly with all measures considered, that choosing the technique that best fit and predicted our data is difficult.

Conclusion

  • The full models (which includes all predictor variables within the dataset) tends to have lower mean squared error (MSE) and higher Area under the Curve (AUC) values after testing.
  • This essentially means the full model is more robust with predictive ability compared to the reduced model.
  • However, the difference in MSEs between the two models is marginal, but the full model by technicality does have slightly more predictive ability. It also has a better ability to distinguish between classes, whether or not the patient died within their follow-up period.

Conclusion (cont.)

  • The marginal difference between models supports the argument made by the researchers because the model with two predictor variables and the model with twelve predictor variables produced small differences in MSEs as seen in the analysis above; models with less predictor variables are typically preferred in statistical analysis.
  • Since the AUC values did show differences between models, the model could perhaps benefit from some additional predictors. However, the reduced models still demonstrated acceptable class discrimination.

Conclusion (cont.)

  • Single hold-out CV stood out the most as having the worst performance, with MSE values that were significantly higher than that of the rest of the techniques.
  • The rest of the models had only marginal differences in performance. However, when considering the process of administering the techniques, the k-folds, including leave-one-out, provided the most straight-forward approach.
  • Ultimately, most of these techniques seem sufficient to help validate a model based on this example.

References

Cantor, Scott B, and Michael W Kattan. 2000. “Determining the Area Under the ROC Curve for a Binary Diagnostic Test.” Medical Decision Making 20 (4): 468–70.
Chicco, Davide, and Giuseppe Jurman. 2020. “Machine Learning Can Predict Survival of Patients with Heart Failure from Serum Creatinine and Ejection Fraction Alone.” BMC Medical Informatics and Decision Making 20: 1–16.
Ferro, Christopher AT, and Thomas E Fricker. 2012. “A Bias-Corrected Decomposition of the Brier Score.” Quarterly Journal of the Royal Meteorological Society 138 (668): 1954–60.
Filzmoser, Peter, Bettina Liebmann, and Kurt Varmuza. 2009. “Repeated Double Cross Validation.” Journal of Chemometrics: A Journal of the Chemometrics Society 23 (4): 160–71.
Haddad, Khaled, Ataur Rahman, Mohammad A Zaman, and Surendra Shrestha. 2013. “Applicability of Monte Carlo Cross Validation Technique for Model Development and Validation Using Generalised Least Squares Regression.” Journal of Hydrology 482: 119–28.
Hossin, Mohammad, and Md Nasir Sulaiman. 2015. “A Review on Evaluation Metrics for Data Classification Evaluations.” International Journal of Data Mining & Knowledge Management Process 5 (2): 1.
“How to Explain the ROC Curve and ROC AUC Score?” 2024. Evidently AI. https://www.evidentlyai.com/classification-metrics/explain-roc-curve.
James, Frederick. 1980. “Monte Carlo Theory and Practice.” Reports on Progress in Physics 43 (9): 1145.
Jung, Yoonsuh, and Jianhua Hu. 2015. “AK-Fold Averaging Cross-Validation Procedure.” Journal of Nonparametric Statistics 27 (2): 167–79.
Lei, Jing. 2020. “Cross-Validation with Confidence.” Journal of the American Statistical Association 115 (532): 1978–97.
Marcot, Bruce G, and Anca M Hanea. 2021. “What Is an Optimal Value of k in k-Fold Cross-Validation in Discrete Bayesian Network Analysis?” Computational Statistics 36 (3): 2009–31.
Rabinowicz, Assaf, and Saharon Rosset. 2022. “Cross-Validation for Correlated Data.” Journal of the American Statistical Association 117 (538): 718–31.
Ranjan, Gopal Krishna. 2021. “Introduction to k-Fold Cross-Validation in Python.” SQLRelease. https://sqlrelease.com/introduction-to-k-fold-cross-validation-in-python.
Santos, Isadora Guimarães dos, Rogério Pinto Espı́ndola, Francisco Thiago Sacramento Aragão, and Luis Alberto Herrmann Nascimento. 2023. “A Statistical Methodology for Reliable Evaluation of Calibrated Dynamic Modulus Regression Models.” Matéria (Rio de Janeiro) 28: e20230188.
Shao, Zhifei, and Meng Joo Er. 2016. “Efficient Leave-One-Out Cross-Validation-Based Regularized Extreme Learning Machine.” Neurocomputing 194: 260–70.
Song, Q Chelsea, Chen Tang, and Serena Wee. 2021. “Making Sense of Model Generalizability: A Tutorial on Cross-Validation in r and Shiny.” Advances in Methods and Practices in Psychological Science 4 (1): 2515245920947067.
Wong, Tzu-Tsung. 2015. “Performance Evaluation of Classification Algorithms by k-Fold and Leave-One-Out Cross Validation.” Pattern Recognition 48 (9): 2839–46.
Wong, Tzu-Tsung, and Po-Yang Yeh. 2019. “Reliable Accuracy Estimates from k-Fold Cross Validation.” IEEE Transactions on Knowledge and Data Engineering 32 (8): 1586–94.
Xu, Lu, Ou Hu, Yuwan Guo, Mengqin Zhang, Daowang Lu, Chen-Bo Cai, Shunping Xie, Mohammad Goodarzi, Hai-Yan Fu, and Yuan-Bin She. 2018. “Representative Splitting Cross Validation.” Chemometrics and Intelligent Laboratory Systems 183: 29–35.
Yadav, Sanjay, and Sanyam Shukla. 2016. “Analysis of k-Fold Cross-Validation over Hold-Out Validation on Colossal Datasets for Quality Classification.” In 2016 IEEE 6th International Conference on Advanced Computing (IACC), 78–83. IEEE.
Zhang, Yongli, and Yuhong Yang. 2015. “Cross-Validation for Selecting a Model Selection Procedure.” Journal of Econometrics 187 (1): 95–112.